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Abstract 

Diffusion is a fundamental grapli process, underpinning sucli plienomena as epidemic disease 
contagion and the spread of innovation by word-of-moutli. We address the algorithmic problem 
of finding a set of k initial seed nodes in a network so that the expected size of the resulting 
cascade is maximized, under the standard independent cascade model of network diffusion. The 
promise of such an algorithm lies in applications to viral marketing. However, runtime is of critical 
importance in this endeavor due to the massive size and volatility of the relevant networks. 

Our main result is an algorithm for the influence maximization problem that obtains the near- 
optimal approximation factor of (1 — ^ — e), for any e > 0, in time 0{{m + n)e^^ \ogn) where 
n and m are the number of vertices and edges in the network. The runtime of our algorithm is 
independent of the number of seeds k and improves upon the previously best-known algorithms 
which run in time Q{mnk-POlJY{e~^)). Importantly, our algorithm is essentially runtime-optimal 
(up to a logarithmic factor) as we establish a lower bound of 0(m + n) on the runtime required 
to obtain a constant approximation. 

We then show how to modify our algorithm to allow a provable tradeoff between solution 
quality and runtime. We obtain an G(^)-approximation in time 0{n ■ a{Q) \og'^ (n) / /3) for any 
/3 > 1, where a{Q) denotes the arboricity of the diffusion network Q. In particular, for graphs 
with bounded arboricity (as is the case for many models of network formation and empirically 
observed social networks) our algorithm is nearly runtime-optimal (up to logarithmic factors) for 
any fixed seed size k. 

Our approach is based on a novel preprocessing scheme that generates a sparse hypergraph 
representation of the underlying network via sampling. We show that this representation makes 
it possible to efficiently estimate marginal influence in the original diffusion process with very few 
samples, and that the quality of this estimation degrades gracefully with reduced preprocessing 
time. 
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1 Introduction 



Diffusion is a fundamental process in tlie study of complex networks, modeling the spread of disease, 
ideas, or product adoption through a population. The common feature in each case is that local 
interactions between individuals can lead to epidemic outcomes. This is the idea behind word- 
of-mouth advertising, in which information about a product travels via links between individuals; 
see, for example, [3l[8l[ini[IIl[l9l[20l|38]. In recent years, as online social network structure has 
become increasingly visible, applications of diffusion models on networks to advertising have become 
increasingly relevant. A prominent application is a viral marketing campaign which aims to use a 
small number of targeted interventions to initiate cascades of influence that create a global increase 
in product adoption [T7l[T9l[M[27] . 

A large part of this interest focuses on the algorithmic problem of inferring potential influencers 
from network topology. Given a network, how can we determine which individuals should be targeted 
to maximize the magnitude of a resulting cascade? [l71[24l[37j. Supposing that there is a limit k 
on the number of nodes to target (e.g. due to advertising budgets), the goal is to efficiently find an 
appropriate set of k nodes with which to "seed" a diffusion process. 

In this paper we develop fast approximation algorithms for the above influence-maximization 
problem, under the standard independent cascade model of influence spread. Before describing these 
results, we first provide some background into the problem at hand. 

The Model: Independent Cascades We adopt the independent cascade (IC) model of diffusion, 
formalized by Kempe et al. [2^. In this model we are given a directed edge- weighted graph Q with 
n nodes and m edges, representing the underlying network. Influence spreads via a random process 
that begins at a set S of seed nodes. Each node, once infected, has a chance of subsequently infecting 
its neighbors: the weight of edge e = {v, u) represents the probability that the process spreads along 
edge e from v to u. If we write I{S) for the (random) number of nodes that are eventually infected 
by this process, then we think of the expectation of I{S) as the influence of set S. Our optimization 
problem, then, is to find set S maximizing E[/(S')] subject to l^j < k. 

For the corresponding algorithmic problem we assume that the network topology is described in 
the sparse representation of an (arbitrarily ordered) adjacency list for each vertex, as is natural for 
sparse graphs such as social networks. 

The IC model captures the common intuition that influence can spread stochastically through 
a network, much like a disease [16l[19l[23]. Also, as noted by Kempe et al. [23], IC is equivalent 
to a uniform linear threshold model in which each node becomes infected after a certain weighted 
fraction of its neighbors are infected, given that these thresholds are drawn uniformly from the unit 
interval. This thresholding behavior is also a common feature in prominent classic models of influence 
spread [22p33j . Another important property of the IC model is its computational tractability. Indeed, 
Kempe et al. show that E[/(-)] is a submodular monotone function [24j, and hence the problem 
of maximizing E [/(•)] can be approximated to within a factor of (1 — ^ — e) for any e > 0, in 
polynomial time, via a greedy hill-climbing method. In contrast, many other formulations of the 
influence maximization problem have been shown to have strong lower bounds on polynomial-time 
approximability (HIEIETIEHIEH] • 

The greedy approach to maximizing influence in the IC model described above takes time 0{kn), 
given oracle access to the function E[/(-)]. In general, however, these influence values must be 
computed from the underlying network topology. A common approach is to simulate the random 
diffusion process many times to estimate influence values for each node. This ultimatelj0 leads to 
a total runtim^ of Q.{mnk ■ POLY(e~^)). Due to the massive size and temporal volatility of many 

^After simple optimizations, such as reusing each simulation for multiple nodes. 

■^The best implementations appear to have running time 0(mnk log(n) ■ POLY(e~^)) [14], though to the best of our 
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network dataset instances, this does not provide a fully satisfactory solution: it is crucial for an 
algorithm to scale well with network size |13p29j. This requirement has spawned a large body of 
work aimed at developing more efficient methods of finding influential individuals in social networks. 
See, for example, p^HT5ll23ll26l [29 t [32 t H0]. However, to date, this work has focused primarily on 
empirical methods; no algorithm has been found to provide provable approximation guarantees in 
time asymptotically less than Q{nmk). 

What is the bottleneck in developing better algorithms? One issue is the difficulty of estimating 
the expected influence of a given vertex. As an illustrating example, consider the simple network of 
a star on n node^ in which each edge is bidirectional and has weight p = n~^^^. In this example the 
center node has expected influence n^^^ (infecting each leaf with probability p), whereas each leaf has 
expected influence 0(n^/^). In this example, estimating the influence of a leaf via random sampling 
requires signiflcant effort: one would have to realize the graph ^^(|) times to obtain a reasonable 
estimate. As each realization requires linear time (to provide estimates for all leaves), a superlinear 
runtime would be required to estimate influences even if we are willing to tolerate multiplicative 
errors as large as n^/^"*^. A different approach is thus required if we are to achieve running time that 
is close to linear in the network size. 

First Result: A Quasi-Linear Time Algorithm Our first and main result is an algorithm 
for finding (1 — 1/e — e)-approximately optimal seed sets in arbitrary directed networks, which 
runs in time 0((m + n)e~^ logn). Importantly, the runtime of our algorithm is independent of the 
number of seeds k and is essentially runtime optimal as we give a lower bound of i}{m + n) on 
the runtime required to obtain a constant approximation for this problem (assuming an adjacency 
list representation). We also note that this approximation factor is nearly optimal, as no polytime 
algorithm achieves approximation (1 — + e) for any e > unless P = NP |24p25j. 

Our method is randomized, and it succeeds with probability 3/5; moreover, failure is detectable, 
so this success probability can be amplified through repetition. 

Our algorithm proceeds in two steps. First, we apply random sampling techniques to preprocess 
the network and generate a sparse hypergraph representation that estimates the influence charac- 
teristics of each node. Each hypergraph edge corresponds to a set of individuals that was influenced 
by a randomly selected node in the transpose graph. This preprocessing is done once, resulting in a 
structure of size 0{{m + n)e~^ log(n)). This hypergraph encodes our influence estimates: for a set 
of nodes S, the total degree of S in the hypergraph is approximately proportional to the influence 
of S in the original graph. We can therefore run a standard greedy algorithm on this hypergraph to 
return a set of size k of approximately maximal total degree. 

To make this approach work one needs to overcome several inherent difficulties. First, we show 
that the marginal influence of a node v is proportional to the probability that v is influenced by 
a randomly chosen node u in the transpose graph. These probabilities can be estimated more effi- 
ciently than infiuence itself. In particular, this transpose-graph formulation simplifies the process of 
estimating marginal influence, so that we need not repeat the estimation procedure when considering 
different partial solutions. This results in substantial savings in runtime. 

The next difficulty to overcome is the stringent runtime constraint — we must construct our 
hypergraph in time 0((m -|- n)e~^ log(n)). We show that this number of steps suffices to approx- 
imate the influence of each set of nodes in the graph. This approximation comes, for each set, 
as a probabilistic guarantee with confidence 1 — l/POLY(n). Finally, in order to prevent errors 
from accumulating when applying the greedy algorithm to the hypergraph, it is important that our 
estimator for influence (i.e. total hypergraph degree) is itself a monotone submodular function. 

knowledge a formal analysis of this runtime has not appeared in the literature. 

''Of course, the star graph is simple enough that there are obvious alternative methods for finding appropriate seed 
sets; we present it merely to illustrate the potential runtime requirements of estimating node influences directly. 
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Our algorithm accesses the network structure in a very hmited way, and is therefore imple- 
mentable in a wide array of graph access models. The only operations used by our algorithm are 
accessing a random vertex and traversing the edges incident to a previously-accessed vertex. In 
particular, our algorithm falls within the jump and crawl paradigm of [7]. 

Second Result: A Sublinear Time Algorithm We extend our approximation algorithm to 
allow a provable tradeoff between runtime and approximation quality. Given a network G with 
arboricit}0 a(G) and a parameter /3 G our algorithm attains approximation S(^) in time 

our knowledge, ours is the first algorithm with such a tradeoff 
between runtime and approximation quality. 

In particular, on networks of bounded arboricity, our algorithm finds a node of approximately 
maximal influence in sublinear time when /? = u:{log'^{n)). We note that many rich classes of 
graphs have bounded arboricity, including planar graphs, graphs with small maximum in-degree or 
maximum out-degree, other graphs with bounded treewidth (as the treewidth is at most twice the 
arboricity), many models of network formation and empirically observed social and technological 
networks [31111^. 

We provide a lower bound of Q.{{'m + n)/ f3 -minj/?, k}) on the runtime needed to obtain an 0{f3)- 
approximation. Thus, for networks with bounded arboricity, our algorithm is essentially runtime- 
optimal (up to logarithmic factors) given any fixed seed size k. Our method is randomized, and it 
succeeds with probability 3/5; moreover, we show that this probability can be amplified through 
repetition. 

The intuition behind our modified algorithm is that a tradeoff between execution time and 
approximation factor can be achieved by constructing fewer edges in our hypergraph representation. 
Given an upper bound on runtime, we can build edges until that time has expired, then run the 
influence maximization algorithm using the resulting (impoverished) hypergraph. We show that this 
approach generates a solution whose quality degrades gracefully with the preprocessing time, with 
an important caveat. If there are many individuals with high influence, it may be that a reduction 
in runtime prevents us from achieving enough concentration to estimate the influence of any node. 
If so, the highest-degree node(s) in the constructed hypergraph will not necessarily be the nodes of 
highest influence in the original graph. However, in this case, the fact that many individuals have 
high influence enables an alternative approach: a node chosen at random, according to the degree 
distribution of nodes in the hypergraph representation, will have high influence with high probability. 

Given the above, our algorithm will proceed by constructing two possible seed sets: one using 
the greedy algorithm applied to the constructed hypergraph, and the other is a singleton selected at 
random according to the hypergraph degree distribution. To decide between the two, we design a 
procedure for efhcienctly estimating the influence of a given set, up to a maximum of n//3. We then 
return the set with higher tested influence. 

Our test for estimating influence proceeds by repeatedly constructing depth-first trees of various 
sizes, rooted at the nodes to be tested. This is done in a careful way in order to keep the overall 
runtime small, which depends on the density of the densest subgraph in the network: if the nodes 
to be tested lie in a particularly dense region of the graph, extra time may be required to build 
partial spanning trees, even if the graph is sparse on average. This dependency is what motivates 
the arboricity term in the approximation factor of the algorithm. 

Finally, we emphasize that our solution concept is to return a set with high expected influence, 
with respect to the influence diffusion process, with probability greater than 1/2 over randomness in 
the approximation algorithm. A potential relaxation would be to develop an algorithm that returns 
a set with high expected influence, where the expectation is with respect to both the diffusion process 

*The arboricity of a network is the minimum number of spanning forests needed to cover all edges [35) . 
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and the randomness in the approximation algorithm. For this weaker notion of approximabihty, the 
final testing phase of the algorithm described above is unnecessary: we could simply return one of our 
two potential solutions at random. This modified algorithm would have a runtime of Q^ ("+"')^°g ("■) ^^ 
dropping the dependence on arboricity. As the lower bound of Q{{m + n)/ (3 ■ min{/3, A;}) holds even 
for this relaxed solution concept, this algorithm is nearly optimal (up to logarithmic factors) for 
arbitrary networks, given a fixed k. While we feel that the original (stronger) approximation notion 
is more relevant, this alternative formulation may be of interest in cases where variability in solution 
quality can be tolerated. 



1.1 Related Work 

Models of influence spread in networks, covering both cascade and threshold phenomena, are well- 
studied in the sociology and marketing literature |19 | I22 | |38]. The problem of finding the most 
influential set of nodes to target for a diffusive process was first posed by Domingos and Richard- 
son |171l37j. A formal development of the IC model, along with a greedy algorithm based upon 
submodular maximization, was given by Kempe et al. |24] . Many subsequent works have studied the 
nature of diffusion in online social networks, using empirical data to estimate influence probabilities 
and infer network topology; see p m iSO t lSTj. 

It has been shown that many alternative formulations of the influence maximization problem are 
computationally difficult. The problem of finding, in a linear threshold model, a set of minimal size 
that influences the entire network was shown to be inapproximable within 0(n^~^) by Chen |12j . 
The problem of determining influence spread given a seed set in the IC model is ^^P-hard |13j . 

There has been a line of work aimed at improving the runtime of the algorithm by Kempe et 
al. |24j . These have focused largely on heuristics, such as assuming that all nodes have relatively 
low influence or that the input graph is clustered [I3l[l5l|26l|40], as well as empirically-motivated 
implementation improvements |14y29|. One particular approach of note involves flrst attempting to 
sparsify the input graph, then estimating influence on the reduced network |15|l32j. Unfortunately, 
these sparsification problems are shown to be computationally intractible in general. 

Various alternative formulations of influence spread as a submodular process have been proposed 
and analyzed in the literature |25tl34j. including those that include inter ations between multiple 
diffusive processes [5l[21]. We focus specifically on the IC model, and leave open the question of 
whether our methods can be extended to apply to these alternative models. 

The infiuence estimation problem shares some commonality with the problems of local graph 
partitioning, as well as estimating pagerank and personalized pagerank vectors [HEKHIEH]. These 
problems admit local algorithms based on sampling short random walks. These methods do not 
seem directly applicable to influence maximization due to the inherently non-local nature of influence 
cascades. 



2 Model and Preliminaries 

The Independent Cascade Model In the independent cascade (IC) model, influence spreads 
via an edge- weighted directed graph Q. An infection begins at a set S of seed nodes, and spreads 
through the network in rounds. Each infected node v has a single chance, upon flrst becoming 
infected, of subsequently infecting his neighbors. Each directed edge e = (u, u) has a weight pe € [0, 1] 
representing the probability that the process spreads along edge e to node u in the round following 
the round in which v was flrst infected. 

As noted in [23], the above process has the following equivalent description. We can interpret 
^ as a distribution over unweighted directed graphs, where each edge e is independently realized 
with probability p^- If we realize a graph G according to this probability distribution, then we can 
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associate the set of infected nodes in the original process with the set of nodes reachable from seed 
set S in G. We will make use of this alternative formulation of the IC model throughout the paper. 



Notation We let m and n denote the number of edges and nodes, respectively, in the weighted 
directed graph Q. We write G ~ ^ to mean that G is drawn from the random graph distribution Q. 
Given set S of vertices and (unweighted) graph G, write GciS) for the set of nodes reachable from 
S in G. When G is drawn from Q, we will refer to this as the set of nodes influenced by S. We write 
Ig{S) = \Cg{S) \ for the number of nodes influenced by S, which we call the influence of S in G. We 
write Eg[/(S')] = KG^g[lQ{S)] for the expected influence of 5" in ^. 

Given two sets of nodes S and W, we write Gg{S\W) for the set of nodes reachable from S but 
not from W. That is, Gg{S\W) = Gg{S) \ Gg{W). As before, we write Ig{S\W) = \Gg{S\W)\; we 
refer to this as the marginal influence of S given W. The expected marginal influence of S given W 
is Eg[I{S\W)] =EG..g[lG{S\W)]. 

In general, a vertex in the subscript of an expectation or probability denotes the vertex being 
selected uniformly at random from the set of vertices of G- For example, E„^g [/(?;)] is the average, 
over all graph nodes v, of the expected influence of v. 

For a given graph G, define G^ to be the transpose graph of G: (u, G G iff {v, u) G G^ . We 
apply this notation to both weighted and unweighted graphs. 



The Influence Maximization Problem Given graph Q and integer k > 1, the influence max- 
imization problem is to find a set S of at most k nodes maximizing the value of Eg[/(S')]. For 
/3 > 1, we say that a particular set of nodes T with |T| < A; is a -^-approximation to the influence 
maximization problem ifEg[I{T)] > ^^^s:\s\=k^e[HS)] _ 

We assume that graph G is provided in adjacency list format, with the neighbors of a given vertex 
V ordered arbitrarily. 



A Simulation Primitive Our algorithms we will make use of a primitive that realizes an instance 
of the nodes influenced by a given vertex u in weighted graph Q, and returns this set of nodes. 
Conceptually, this is done by realizing some G ^ Q and traversing Gg{u). 

Let us briefly discuss the implementation of such a primitive. Given node u, we can run a depth 
first search in Q starting at node u. Before traversing any given edge e, we perform a random test: 
with probability pe we traverse the edge as normal, and with probability 1 — we do not traverse 
edge e and ignore it from that point onward. The set of nodes traversed in this manner is equivalent 
to Cg{u) for G ^ Q, due to deferred randomness. We then return the set of nodes traversed. The 
runtime of this procedure is precisely the sum of the degrees (in (?) of the vertices in Cg{u)- 

We can implement this procedure for a traversal of G^, rather than G, by following in-links 
rather than out-links in our tree traversal. Also, we will sometimes wish to run this procedure with 
an upper bound on the number of nodes to traverse; in this case we simply abort the depth-first 
traversal when the bound has been reached and return the set of nodes explored up to that point. 



3 An Approximation Algorithm for Influence Maximization 

In this section we present an algorithm for the infiuence maximization problem on arbitrary directed 
graphs. Our algorithm returns a (1 — ^ — e)-approximation to the infiuence maximization problem, 
with success probability 3/5, in time 0((m + n)e~^ log n). We note that this algorithm is a simplifi- 
cation of a more general version that permits a tradeoff between runtime and approximation, which 
appears in Section [H 
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Algorithm 1 Maximize Influence 



Require: Precision parameter e G (0, 1), directed edge- weighted graph Q. 
1: ^ 72(m + n)e~^ log(n) 
2: H BmldHypergraph(i?) 
3: return BuildSeedSet('H, A;) 

BuildHypergraph(i?) : 

1: Initiahze H = (y,0). 
2: repeat 

3: Choose node u from Q uniformly at random. 

4: Simulate influence spread, starting from u, in Q'^ . Let T be the set of nodes discovered. 
5: Add T to the edge set of Ti. 

6: until R steps have been taken in total by the simulation process. 
7: return 7i 

BuildSeedSetCH,/^): 
1: for i = 1, . . . , A; do 

2: Vi ^ argmax^{degu{v)} 

3: Remove Vi and all incident edges from 7i 

4: return {vi, . . . , Vk} 



The algorithm is described formally as Algorithm [H but let us begin by describing our construc- 
tion informally. Our approach proceeds in two steps. The first step, BuildHypergraph, generates 
a sparse, randomized hypergraph representation H of our underlying graph G. This is done by 
repeatedly simulating the influence spread process on the transpose of the input graph, G^. This 
simulation process is performed as described in Section [2} we begin at a random node u and proceed 
via depth-first search, where each encountered edge e is traversed independently with probability 
Pe- The set of nodes encountered becomes an edge in "H. We then repeat this process, generating 
multiple hyperedges. The BuildHypergraph subroutine takes as input a bound R on its runtime; 
we continue building edges until a total of R steps has been taken by the simulation process. (Note 
that the number of steps taken by the process is equal to the number of edges considered by the 
depth-first search process). Once R steps have been taken in total over all simulations, we return 
the resulting hypergraph. 

In the second step, BuildSeedSet, we use our hypergraph representation to construct our output 
set. This is done by repeatedly choosing the node with highest degree in H, then removing that 
node and all incident edges from H. The resulting set of k nodes is the generated seed set. 

While Algorithm [J is relatively simple to describe, it is not obvious at all why such an algorithm 
should work well under its imposed, stringent time constraints. We now turn to provide a detailed 
analysis of Algorithm [H Fix k and a weighted directed graph Q. Let OPT = ma'x.s.\g\^i^{Kg[I{S)]}, 
the maximum expected infiuence of a set of k nodes. 

Our goal is to bound the approximation factor of the set returned by Algorithm [TJ 

Theorem 3.1. Fix e > 0. AlgorithmUl returns a set S with Eg[I{S)] > (1 - ^ - e)OPT, with 
probability at least 3/5, and runs in time Q(^ (™+"j^i°g("-) ^_ 

The idea behind the proof of Theorem 13.11 is as follows. First, we observe that the infiuence of 
a set of nodes S is precisely n times the probability that a randomly selected node u influences any 
node from S in the transpose graph . 
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Observation 3.2. For each subset of nodes S Q G 



nFVu,G^g[SnCGT{u)y^^]. 



Proof. 



KG^gilciS)] = FrG^g[3v G S such that u G Cciv)] 




= nPru,G'^g[^v G S such that v G Cqt{u)] 
= nFvu,G^g[SnCGT{u)^i!}]. 



□ 



Observation 13.21 imphes that we can estimate Eg [1(5)] by estimating the probability of the event 
ShCqt^u) / 0. The degree of a node u in is precisely the number of times we observed that v was 
influenced by a randomly selected node u. We can therefore think of H as encoding an approximation 
to the influence function in graph Q. 

We now show that the algorithm takes enough samples to accurately estimate the influences of the 
nodes in the network. This requires two steps. First, we show that runtime R = 72(m + n)e~^ log(ra) 
is enough to build a sufficiently rich hypergraph structure, with high probability over the random 
outcomes of the influence cascade model. 

Lemma 3.3. Hypergraph % will contain at least ^^p^J"^ edges, with probability at least |. 

Proof. Given a vertex u and an edge e = (f , w), consider the random event indicating whether edge 
e is checked as part of the process of growing a depth-first search rooted at u in the IC process 
corresponding to graph ~ Q'^ . Note that edge e is checked if and only if node v is influenced by 
node u in this invocation of the IC process. In other words, edge e = (v, w) is checked as part of the 
influence spread process on line 4 of BuildHypergraph if and only if G T. Write mQT{u) for the 
random variable indicating the number of edges that are checked as part of building the influence 
set T starting at node u in . 

Let X = ^'^p'^^"^ for notational convenience. Consider the first (up to) X iterations of the loop 
on lines 2-6 of BuildHypergraph. Note that T-i will have at least X edges if the total runtime of the 
first X iterations is at most R. The expected runtime of the algorithm over these iterations is 



X ■ E„,G^e[l + mGT{u)] =X + -Ec^g 



^mGT(n) 



u 




e- 



{v,w)^g'^ 





e={v,w)^g'^ 




24nlog(n) 24 log (n) 



OPT 



e={v,w)S:g'^ 



24(m -|- n) log(n) 
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Here , the first inequality (line 4 from above) follows by noting that an edge {v, w) € Q'^ is traversed 
as part of mQT{u) if and only if v appears in Cqt{u). 

Thus, by the Markov inequality, the probability that the runtime over the first X iterations is 
greater than R = 72{m + n)e~^ log(n) is at most |. The probability that at least X edges are present 

3' 



in hypergraph % is therefore at least i, as required. □ 



Next, we show that the resulting hypergraph is of sufficient size to estimate the influence of each 
set, up to an additive error that shrinks with e, with probability 1 — l/POLY(n). We then show 
that such estimation guarantees suffice to find good approximations to the influence maximization 
problem. 

Write m{'H) and degy^^S) for the number of edges of Ti and the number of edges from T-L incident 
with a node from S, respectively. An important subtlety is that the value of m{'H) is a random 
variable, determined by the stopping condition of BuildHypergraph. We must be careful to bound 
the effect on our influence estimation. We show that the value of m{T-L) is sufficiently concentrated 
that the resulting bias is insignificant. 

Lemma 3.4. Suppose that m{T-i) > ^^^^r^. Then, for any set of nodes S V , Pr[|Eg[/(S)] — 
^^'^^ deg-}i{S)\ > eOPT] < with probability taken over randomness in T-L. 

Proof. By assumption we have that mCH) > 24(m + n) log(n)/(OPTe^), and we also know m{7i) < 
R = 72{m + n) log(n)/e^ (since each edge has size at least 1). Fix some arbitrary 

24(m + n) log(n) 24(m + n) log(7i) 
^ ^ OFT? ' ? ^' 

We will study the probability that 

\lKg[IiS)]-——degn{S)\>eOFT 
7n[H) 

conditional on m{'H) = T. 

Suppose first that Eg[/(S')] > eOPT. Let Ds denote the degree of S in T-L, for notational 
convenience. Thinking of Ds as a random variable, we have that Ds is the sum of m{'H) = T 
identically distributed Bernoulli random variables each with probability 'Kg[I[S)]/n > eOPT/n, by 
Observation 13.21 In particular, since m{'H) = T > 24(m + n) log(?T,)/(OPre^), 

fO PT 

¥.[Ds] > m{n) > 24e-2logn. 

n 

The Multiplicative Chernoff bound (jA.ip then implies that 



Pr 



Ds<il-e)'^Eg[IiS)] 
n 



Pr [Ds < (1 - e)E[Ds]] 



-VlDs]e^/2 ^ ^-121og(n) ^ J_ 

12 ■ 



n 



Similarly, we can use the Multiplicative Chernoff bound to conclude that 



Pr 



Ds>{l + e)'-^Eg[IiS)] 



< e 



-E[Ds]e2/4 



< g-61og(n) _ J_ 
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Next suppose that Eg [7(5')] < eOPT, so that ^[Ds] 
phcative Chernoff bound imphes that 



< e • OPT ■ m['H)/n. In this case, the Multi- 



Pr 



Ds > E[Z)c 



eOPT 



n 



m{n) 



Pr 



Ds > nDs] 1 + 



eOPT m{n) 



< e~ 



< e 



€OPT 



m{H)/2 



-121og(n) 



< 



1 



n 



12 • 



Thus, in all cases, the probability that the event of interest occurs is at most \. Since we conditioned 



on m{'H) = T, and since there are no more than 24(m + n) log(n)/e^ = o(n"^ 
the union bound implies that the unconditional probability that |Eg;[/(S')] 

is at most as required. 



potential values of T, 
^degn{S)\ > eOFT 

□ 



Finally, we must show that the greedy algorithm applied to Ti in BuildSeedSet returns a good 
approximation to the original optimization problem. Recall that, in general, the greedy algorithm 
for submodular function maximization proceeds by repeatedly selecting the singleton with maximal 
contribution to the function value, up to the cardinality constraint. The following lemma shows 
that if one submodular function is approximated sufficiently well by a distribution of submodular 
functions, then applying the greedy algorithm to a function drawn from the distribution yields a 
good approximation with respect to the original. 

Lemma 3.5. Choose S > and suppose that /: 2^ — > M>o is a nan- decreasing submodular function. 
Let D be a distribution over non- decreasing submodular functions with the property that, for all sets 
S with \S\ < k, Pr^^j-,[|/(5') — f{S)\ > 6] < Xjr? . If we write S j for the set returned by the greedy 

algorithm on input f , then 



Pr 



/(5;)<(l-l/e)( max /(5) 



25 



< 1/n. 



Proof. Choose S* S argmax|5|=;j{/(S')}. With probability at least 1 — f{S*) > f{S*) — S. So, 

in particular, maxi^i^^ /(S") > f{S*) — 6. 

We run the greedy algorithm on function /; let Si be the set of nodes selected up to and including 
iteration i (with Sq = 0). On iteration i, we consider each set of the form Si-i U {x} where x is a 
singleton. There are at most n of these sets, and hence the union bound implies that / and / differ by 
at most 5 on each of these sets, with probability at least 1 — In particular, \f{Si) — f{Si)\ < 6. 

Taking the union bound over all iterations, we have that 1/(5'^) — f{Sk)\ < S with probability at 
least 1 — 1/n. We therefore have 



l/e)max/(S') 



5 > (1 - l/e)f{S* 



26 



f{Sk) > f{Sk) -6>{1 

conditioning on an event of probability 1 — 1/n 

We are now ready to complete our proof of Theorem 13. 1[ 

I and 13.4] together imply that, conditioning on an event of proba- 



□ 



Proof of Theorem \3.1l Lemma 
bility at least 3/5, we will have 

Pr 



Eg [1(5)] 



n ■ degnjS) 
m{n) 



> eOPT 



< 



1 



n-" 
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for each S C V. We then apply Lemma E3] with f{S) := Eg[I{S)], f{S) := "'^^^^j'^^ (drawn 
from distribution corresponding to distribution of H returned by BuildHypergraph), and 6 = eOPT. 
Lemma 13.51 imphes that, with probabiUty at least 1 — ^, the greedy algorithm applied to T-L returns 
a set S with ^g[i{s)] > (1 - l/e)OPT - 2eOPT = (1 - 1/e - 2e)0PT. Noting that this is precisely 
the set returned by BuildSeedSet gives the desired bound on the approximation factor (rescaling e 
by a factor of 2). Thus the claim holds with probability at least 2/3 — 1/n > 3/5 (for n > 20). 

Finally, we argue that our algorithm can be implemented in the appropriate runtime. The fact 
that BuildHypergraph executes in the required time follows from the explicit bound on its runtime. 
For BuildSeedSet, we will maintain a list of vertices sorted by their degree in 7i; this will allow us to 
repeatedly select the maximum-degree node in constant time. The initial sort takes time O(nlogn). 
We must bound the time needed to remove an edge from H and correspondingly update the sorted 
list. We will implement the sorted list as a doubly linked list of groups of vertices, where each 
group itself is implemented as a doubly linked list containing all vertices of a given degree (with only 
non-empty groups present). Each edge of Ti will maintain a list of pointers to its vertices. When 
an edge is removed, the degree of each vertex in the edge decreases by 1; we modify the list by 
shifting any decremented vertex to the preceding group (creating new groups and removing empty 
groups as necessary). Removing an edge from T-L and updating the sorted list therefore takes time 
proportional to the size of the edge. Since each edge in T-L can be removed at most once over all 
iterations of BuildSeedSet, the total runtime is at most the sum of node degrees in T-L, which is at 
most R = 0{{m + n)e~^ log(n)). □ 

3.1 Amplifying the Success Probability 

Algorithm [1] returns a set of influence at least (1 — ^ — e) with probability at least 3/5. The failure 
probability is due to Lemma 13.31 hypergraph Ti may not have sufficiently many edges after R steps 
have been taken by the simulation process in line 4 of the BuildHypergraph subprocedure. However, 
note that this failure condition is detectable via repetition: we can repeat Algorithm [J multiple times, 
and use only the iteration that generates the most edges. The success rate can then be improved by 
repeated invocation, up to a maximum of 1 — 1/n with log(n) repetitions (at which point the error 
probability due to Lemma 13.41 becomes dominant). 

We next note that, for any i > 1, the error bound in Lemma 13.41 can be improved to by 
increasing the value of i? by a factor of i, since this error derives from Chernoff bounds. This would 
allow the success rate of the algorithm to be improved up to a maximum of 1 — ^ by further repeated 
invocation. To summarize, the error rate of the algorithm can be improved to 1 — ^ for any £, at 
the cost of increasing the runtime of the algorithm by a factor of £^ log(n). 

4 Approximate Influence Maximization in Sublinear Time 

In this section we describe a modified algorithm that provides a tradeoff between runtime and 
approximation quality. For an an arbitrary /3 > 1, our algorithm will obtain a 0(l//3)-approximation 
to the influence maximization problem, in time Q^ "'°'(^)^°g (") where a{G) is the arboricity of 
graph Q, with probability at least 3/5. The success rate can be improved by standard amplification 
techniques; we discuss this in Section [4.11 

Our algorithm is listed as Algorithm [2] below. The intuition behind our construction is as 
follows. We wish to find a set of nodes with high expected influence. One approach would be 
to apply Algorithm [1] and simply impose a tighter constraint on the amount of time that can be 
used constructing our hypergraph representation. This might correspond to reducing the value of 
parameter R by, say, a factor of f3. Unfortunately, the precision of our sampling method does not 
always degrade gracefully with /3: if /3 is sufficiently large, we may not have enough data to guess at 



10 



Algorithm 2 Influence Maximization with Tradeoff 



Require: Approximation parameter /3 > 1, directed weighted graph Q. 

(24-36)(n+m) log^(n) 



R . ^ 

H ^ BmldHypergraph(i?) 
S ^ BuildSeedSet(?^,fc) 

Choose V G V with probability proportional to degree in V. 
if Testlnfluence(5) > Testlnfluence(?;) then return 5 
else return {v} 



Testlnfluence(5'): 

- r: our guess at (3 times the influence of S. 

- L: our guess at the realized influence size that contributes most to the expected influence. 
1: for r = n, n/2, n/4, . . . , 1 do 

for L = n, n/2, n/4, . . . , r do 
for J = 1, ... , 32(L/t) log(n) do 

Simulate influence in Q, starting from S, to a maximum of L//3 distinct nodes. 



Let Tj be the set of nodes discovered. 



if \Tj\ > 96(n//3) log(n) then return r//3 
if at least 2561og(n) sets Tj satisfy \Tj \ > L//3 then return t//3 
return 1 



a maximum-influence node (even if we allow ourselves a factor of /3 in the approximation ratio) . In 
these cases, the sampling approach fails to provide a good approximation. 

However, as we will show, our sampling fails precisely because many of the edges in our hyper- 
graph construction were large, and (with constant probability) this can occur only if many of the 
nodes that make up those edges have high influence. In this case, we could proceed by selecting a 
node from the hypergraph at random, with probability proportional to its hypergraph degree. We 
prove that this procedure is likely to return a node of very high influence precisely in settings where 
the original approach is not. 

However, for this to work we need to figure out which of the two approaches will succeed (since, 
in particular, the required bound on the hypergraph size is a function of OPT, which is unknown). 
We therefore apply both methods, then directly estimate the influence of each solution. We must do 
so carefully in order to keep our runtime small. To this end, the TestlnHuence procedure generates a 
rough sketch of the probability distribution over influence values by repeatedly growing depth-first 
trees of various sizes. By balancing tree sizes with sampling precision, we are able to give an estimate 
of influence up to a maximum of n//3. We then return whichever solution has the greater estimated 
influence. 

We note that, since our flnal estimation step proceeds by repeatedly exploring subgraphs of an 
input graph G, its runtime will be tied to the arboricity of Q. Indeed, the arboricity is used to 
explicitly bound the time needed to construct a depth-first tree of a given size. 

Definition 1. The arboricity number of an undirected graph Q, denoted as a{Q), is the minimum 
number of spanning forests needed to cover all the edges of the graph. The arboricity of a directed, 
weighted graph is the arboricity of the undirected, unweighted version of the graph. 

Theorem 4.1. For any /? > 1, Algorithmic returns, with probability of at least 3/5, a node with 
expected influence at least min{^, ^} • OPT. Its runtime is 0( "'°^^^^°^ ^"^ ). 

Before proving Theorem 14.11 let us discuss its statement and some minor variations. 
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Discussion: It will turn out that the runtime of Algorithm [2] is dominated by subprocedure Testln- 
fluence: the time needed to estimate the influence of a given solution. Given access to an oracle that 
estimates the influence of a given set, up to a maximum of the runtime of Algorithm [2] becomes 
Q^ {n+m)iog in) ^x{n,m,P)), where X(n, m, /3) is a bound on the runtime of the oracle. We provide 
an implementation of such an oracle, based on growing spanning forests. Our analysis is given in 
terms of the arboricity of the underlying graph and shows that X{n,m, (3) C One 
potential method for improving our analysis would be to find a more efficient impliementation of an 
influence-approximating oracle. 

Another important note is in regards to the probabilistic guarantee we require from our algorithm. 
Algorithm [2] returns a set that has high influence (in expectation over the random diffusion process), 
with probability at least 3/5 over the random bits of the algorithm. The runtime of Algorithm [2] can 
be improved if our goal is relaxed to returning a set of high influence in expectation both over the 
random diffusion process and the randomness in the algorithm. For this relaxed solution concept, it 
becomes unnecessary to test the influence of the two potential solutions: the algorithm can simply 
choose between them at random. The expected influence of the set returned would then be at least 
OPT/2/3. The runtime of this modified algorithm is ("'+'")^°g ("■) ^_ □ 

Let us now turn to the proof of Theorem 14. 1[ Observe that when /3 = O(logn), the conditions of 
Theorem 14. II are satisfied by Algorithm [T] with (say) e = 1/6 (to get an approximation factor bigger 
than 1/2), since its runtime is 0((m + n) log(n)) C Q^ n-a(g)k)g (n) ^ ^^^^ Fact 14.51 in the appendix). 
We can therefore assume that /? > log(n). 

Our analysis proceeds via two cases, depending on whether 7i has sufficiently many edges as a 
function of OPT. We first show that, subject to Ti having many edges, set S from line 3 is likely to 
have high influence. This follows the analysis from Theorem 13.11 almost exactly. 

Lemma 4.2. Suppose that m{'H) > ^^Qpy"^ • Then, with probability at least 1 — ^, set S satisfies 
Eg[/(S')] > ^OPT, with probability taken over randomness in %. 

Proof. This follows by applying Lemma[33]with e = g, followed by the analysis of BuildSeedSet('H, k) 
from the proof of Theorem 13.11 □ 

Note that j3 does not appear explicitly in the statement of Lemma 14.21 The (implicit) role of /5 
in Lemma 14.21 is that as (3 becomes large. Algorithm [2] uses fewer steps to construct hypergraph % 
and hence the condition of the lemma is less likely to be satisfled. 

We next show that if m[T-L) is small, then node v from line 4 is likely to have high influence. 
Recall our assumption that (3 > log(n). 

Lemma 4.3. Suppose that m{l-L) < ^^op^"^ • Then, with probability at least 2/3, node v satisfies 
¥,g[I{v)] > OPTlog(n)//3, with probability taken over randomness in %. 

Proof. Let random variable X denote the number of times that a node with influence at most 
OPTlog(n)//3 was added to a hyperedge of %. Since % has fewer than ^^q'"^^"^ edges, the expected 
value of X is at most 

E[X] < ^^qpt""^ ^ ^min{Eg[J(n)],OPTlog(n)//3} 



< 



24n log (n) 



Markov inequality then gives that Fy[X > C^^'^^^iog (n) j ^ ^^^^ Conditioning on this event, we 
have that at most (^^•6)"^*°g (") Qf nodes touched by BuildHypergraph have influence less than 
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OPT\og{n)/p. Since at least ^^^■^'^^^"g W ^^^^^ 

were touched in total, the probability that node 
V from line 4 has influence less than OPT\og{n) / (5 is at most 1/6. The union bound then allows us 
to conclude that v has !£[/(?;)] > OPT/ 15 with probability at least 1 - (1/6 + 1/6) > 2/3. □ 

Next, we show that procedure Testlnfluence(5') returns a sufficiently good estimate of the in- 
fluence of a given set of nodes. The proof, which is somewhat technical, is deferred to Section 

Lemma 4.4. Given a set of nodes S procedure Testlnfluence returns a value from the range [x/\og{n),x] 
with probability at least ^ — where x = min{Eg[/(S')], n//3}. 

Finally, we show that the runtime of Algorithm [2] is at most °(^)^°g (") The following fact 
about arboricity will be particularly useful. 

Fact 4.5 (Nash- Williams [35]). Given any graph G and subgraph H of G (containing at least two 
nodes), < a{G). 

As noted by Nash- Williams this characterization of arboricity is tight as there is always one such 
H with \ = a{G). We are now ready to bound the runtime of Algorithm [2j 

Lemma 4.6. Algorithmic runs in time Q(^ "'"(^)^°g (") 

Proof. Similarly to the analysis of Algorithm [H lines 1-3 take at most R = 0{ ("-+™)^*°g Fact 14.51 
implies that this is o( "'°g'y-°(g) ). 

We must now show that calling Testlnfluence () costs at most 0( "^°^ ^ -which completes 

the proof. On each choice of L and r Testlnfluence() grows several trees, where the sum over those 
tree sizes is at most O(^) logn. By the Nash- Williams theorem [35], each subgraph explored during 
this process has arboricity at most a{G)- 

Thus the total time to build the trees (for a given r and L) is most 0( "''^^^ log(n)). To implement 
the test on line 7, for each value of L, one needs to keep track on the number of trees with size at 
least ^. This can easily be done online by keeping a counter, updating the counter after each tree is 

realized. Since we also iterate over log(n) values for L and r, the total runtime is O C^^"^ q 

We are now ready to complete the proof of Theorem 14. 1[ 

Proof of Theorem \4. 1\ Lemma [4. 2 1 and Lemma 231 imply that, with probability at least 2/3 — 1/n^ > 
3/5 (for n > 5), one of S or {v} has influence at least OPT\og{n) / j3 (recalling that /3 > logn). Since 
Testlnfluence determines the influence of each set up to a potential under-valuation of a factor of 
log(n), the set for which Testlnfluence returns the highest estimate must therefore have influence at 
least QP'^^°g(") . jj^^^ = 0PT//3. The required bound on the runtime of Algorithm [2] follows directly 
from Lemma |4.6[ □ 

4.1 Amplifying the Success Probability 

Algorithm [2] returns, with probability 3/5, a set that approximates the maximum expected influence 
of a node in Q. We note that this returned set comes with an estimate of the optimal influence in 
the network; the "failure" conditions correspond to selecting this estimate incorrectly. To amplify 
success probability, we could return this estimate along with the set S. One could then call the 
algorithm multiple times; each successful invocation would generate an estimate at least 0PT//3, 
whereas failed invocations would potentially generate smaller estimates. Amplification of success 
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probability then corresponds to accepting whichever output is associated with the highest estimate. 
This would allow success probability to be raised up to 1 — 1/n, at which point sampling error in 
Lemma 14.21 becomes the dominant error factor. 

As in our original algorithm, the error bound in Lemma 14.21 can be improved to 1 — by 
increasing the value of i? by a factor of i, since this error derives from Chernoff bounds. This would 
allow the success rate of the algorithm to be improved up to a maximum of 1 — ^ by further repeated 
invocation. To summarize, the error rate of the algorithm can be improved to 1 — for any £, at 
the cost of increasing the runtime of the algorithm by a factor of i"^ log(n). 



4.2 A Lower Bound 

We now provide a lower bound on the time it takes to compute a /3-approximation for the maximum 
expected influence problem. In particular, for any given budget k, at least Vt{n/P) queries are 
required to obtain approximation factor with constant probability. 

Theorem 4.7. Let{) < e < 1, /3 > 1 &e given. Any (possibly randomized) algorithm for the maximum 
influence problem that has runtime of 24/3 min{fc p} cannot return, with probability at least 1 — \ — e, 
a set of nodes with approximation ratio better than ^. 

Proof. Note first that for a graph consisting of n singletons, an algorithm must return at least k/P 
nodes to obtain an approximation ratio of -g. Doing so in at most n/2/3^ queries requires that 
2k/ 13 < n/13'^, which implies 2kl3 < n. We can therefore assume 2A;/3 < n for the remainder of the 
proof. 

Consider the behavior of a randomized algorithm A. Assume for notational simplicity that /3 is 
an integer. We will build a family of lower bound graphs, one for each value of n (beginning from 
n = /3 + 1); each graph will have m < n, so it will suffice to demonstrate a lower bound of 12/3 min{fc /3} • 

For a given value /? the graph would be made from k components of size 2/3 and n — 2k f3 singleton 
components (recall that 2k /3 < n). If algorithm A returns nodes from i of the k components of size 
2/3, it achieves a total influence of 2£/3 + {k — i). Thus, to attain approximation factor better than 
■g, we must have 2i(3 + {k — £) > ^2/c/3, which implies i > 2^-\ ^'^'^ /5 > 1. 

Suppose k > 12/3. The condition I > 2^-1 implies that at least 2^-1 °f large components 
must be queried by the algorithm, where each random query has probability of hitting a large 
component. If the algorithm makes fewer than queries, then the expected number of components 

hit is • = Multiplicative chernoff bounds then imply that the probability hitting more 

i, — ^ .A I A. 

than 2^ components is no more than 1 — e >1 — 1/e, a contradiction. 

If A; < 12/3 then we need that ^ > 1, which occurs only if the algorithm queries at least one of 
the /c/3 vertices in the large components. With queries, for n large enough, this happens with 

probability smaller than ^ + e, a contradiction. 

We conclude that, in all cases, at least 12/3 minjfc /3} Qu^'^i^s are necessary to obtain approximation 
factor better than ^ with probability at least 1 — ^ — e, as required. 

Finally, we note that our construction can be modified to apply to non-sparse networks, as follows. 
For any d < n, we can augment our graph by overlaying a d-regular graph with exponentially small 
weight on each edge. This does not significantly impact the infiuence of any set of nodes, but it 
increases the time to determine whether a node is in a large component by a factor of 0(d) (as edges 
must be traversed until one with non-exponentially-small weight is found). Thus, for each d < n, we 
have a lower bound of 24/3m'in{fc ^} ^'-'^ networks with m = nd. □ 

Discussion: The lower bound construction of Theorem 14. 71 is tailored to the query model considered 
in this paper. In particular, we assume that vertices are not sorted by degree, component size, etc. 
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However, the construction can be easily modified to be robust to various changes in the model, by 
(for example) adding edges with small weight so that the exhibited network Q becomes connected 
and /or d-regular for some fixed d (independent of n) . □ 



4.3 Proof of Lemma 14.41 

First recall the statement of Lemma [4. 4[ We must show that, given a set of nodes S, with probability 
at least ^ — procedure Testlnfluence returns a value from the range [x/ log(n), x], where x = 
min{Eg[/(S')], n//3}. The key to algorithm Testlnfluence is to efficiently decide, for a given set S 
and potential influence value r < n/f3, whether S has influence at least Tlog(n) or less than r. We 
begin with two useful observations that will assist us in this task. 

Lemma 4.8. For any v G Q, there exists some t, 1 < t < logn, such that 2^ PvG^g[lG{v) > 2*] > 
EGr.g[lG{v)]/ login). 

Proof. 

n 

KG[I{v)] = Y,P^Gr.g[lG{v)>y] 

y=l 

logn 2*+i 

= E P^g^g[Ig{v) > y] 

t=0 y=2t 
logn 

< ^2*PrG.g[/G(t') >2*] 
i=0 

and hence EG^g[lG{v)] < log(n) maxj{2* PrG'r^g[/s(f ) > 2*]}, as required. □ 

Lemma 4.9. For any v € Q and any 1 <t < logn, 2* PrG^g[lG(f ) > 2*] < Efj^g [/(;;(';;)]. 

Proof. 

n 

l^G^gilGiv)] = J2y^^G^Gi^G{v) = y] 

y=l 

n 

<^2'PTG^g[lG{v)=y] 
= 2*PrG^g[/G(t') >2*]. 

□ 

Note that we think of L in algorithm Testlnfluence(S') as a guess at the value of 2* from the 
statement of Lemma 14.81 We are now ready to proceed with the proof of Lemma 14.41 

Consider a given iteration of Testlnfluence, corresponding to a value of r. We think of r as a 
guess at the value of Kg[I{S)](3. We will first show that, with high probability, we do not return on 
an iteration in which KGr^g[I{S)] < t/(3. We must consider the return conditions on line 6 and line 
7. 

Suppose EG'^g[/G(5)] < r//3. Consider the return condition on line 7. For set S, PrG^g[/G!(5) > 
L//3] < {t/L) for each choice of L, by Lemma 14.91 Thus, for any given choice of L, the probability 
that the event [Ig{S) > L//3] occurs more than 2561og(n) times in (L/r)32 log(n) trials is (by 
multiplicative Chernoff bounds) at most e~^^°^^^^ = 1/n^. Taking the union bound over all values 
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of L, we conclude that with probabihty at least l — \/v? we will not return on an iteration for which 
W.G^g[lG{S)] <t//3 on line 7. 

Next consider the condition on line 6. Suppose ^G-^gVciS)] < t/P, and pick some L' G 
{n, n/2, . . . , r}. As above, PrG^g[/G'(S') > L' / P] < (t/L'). Consider the sets of nodes built for 
node S on the iteration corresponding to some choice of L. If L < L' , then each set Tj will have size 
at most L' //3. For any L > L', we know that by the multiplicative chernoff bound, the probability 
that event [Ig{S) > L'/P] occurs more than 2(L/r)32 log(?i) • {t/L') = 64(L/L') log(n) times is at 
most e-'"°g(") = 1/n''. Taking the union bound over all choices of L and L', we have that with 
probability at least 1 — for each L and L' < L, fewer than 2(L/L')32 log(n) trees built on 

iteration L have size greater than L'//3. Thus, conditioning on this event, we have that the sum of 
tree sizes is less than 

^(27/3)2(L/2*)log2(n) < 2^321og3(n). 

t=i ^ 

Taking the union bound over all values of L, we conclude that with probability at least 1 — we 
will not return on an iteration for which E,G^g[lGiS)] < ^, on line 6. 

We will now show that, in an iteration in which rlog(n)//3 < Eg[/(S')], the algorithm returns 
with high probability. Suppose S has KGn^g[lG{S)] > (r//3) log(n). Then, by Lemma l48| there exists 
some L > T, L a power of 2, such that Prc^g [/g'(S') > L/P] > jj^log{n). Multiplicative chernoff 
bounds imply that, during the 32(L/r) log(n) trees built for set S, the probability that fewer than 
^321og(n) = 161og(n) reach size L/(3 is at most e"^^"^*-"-* = Thus, assuming our algorithm 

does not return on line 6, it will return on line 7 with probability at least 1 — 0(l/n^), as required. 
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A Concentration Bounds 

Lemma A.l. (Multiplicative Chernoff Bound) Let Xi he n i.i.d. Bernoulli random variables with 
expectation fi each. Define X = 'Y^=i -^i ■ Then, 
For < A < 1, Pr[X < (1 - X)fin] < exp(-/inAV2). 
For < A < 1, Pr[X > (1 + X)firi\ < exp(-/inAV4). 
For A > 1, Pr[X > (1 + A)/in] < exp(-;unA/2). 

Lemma A. 2. (Additive Chernoff Bound) Let Xi be n i.i.d. Bernoulli random variables with expec- 
tation /i each. Define X = X]r=i -^i- Then, for A > 0, 
Pr[X < /in - A] < exp(-2AVra). 
Pr[X > /in + A] < exp(-2AVra)- 
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